STATS 32 Session 7: Importing your own data and factors

Kenneth Tay

Oct 15, 2019

Reminder!

Project proposals are due TOMORROW 16 Oct (Wed), 23:59:59

Recap of week 3

Function syntax

The most important syntax in R is the function call. All R syntax has function calls underlying it.

A function call consists of:

function_name(<inputs to the function>,
              <arguments which change 
              how the function operates>)

Function example

function_name(<inputs to the function>,
              <arguments which change 
              how the function operates>)
x <- c(-5, -3, -1, 1, 3, NA)
mean(x)
## [1] NA

Function example

function_name(<inputs to the function>,
              <arguments which change 
              how the function operates>)
x <- c(-5, -3, -1, 1, 3, NA)
mean(x, na.rm = TRUE)
## [1] -1

%>% syntax with dplyr

Take the mtcars dataset, select just the wt and mpg columns, then select rows with mpg < 15

mtcars %>% 
    select(wt, mpg) %>% 
    filter(mpg < 15)

+ syntax with ggplot2

library(ggplot2)
ggplot(data = mtcars, mapping = aes(x = wt, y = hp)) +
    geom_point() +
    labs(title = "Horsepower vs. Weight", x = "Weight", 
         y = "Horsepower") +
    theme_classic()

Agenda for today

“Official” cheat sheet for readr available here.

Where does your data live?

Filepath example (Mac)

(Source: iDB)

Filepath example (Windows)

When I download a file, where does it go?

In Chrome: go to chrome://settings/downloads to find out

File paths

File paths

Working directories in R

How can I change my working directory in RStudio?

  1. You can issue the command setwd("<path of new directory>")
  2. In the menu bar, click Session > Set Working Directory, then click one of the options in the sub-menu

Factors

Why use factor variables instead of character variables?

Reason 1: Character variables don’t protect you from typos

x <- c("Dec", "Apr", "Jam", "Mar")

Why use factor variables instead of character variables?

Reason 1: Character variables don’t protect you from typos

x <- c("Dec", "Apr", "Jam", "Mar")

Reason 2: Character variables don’t sort in a useful way

x <- c("Dec", "Apr", "Jan", "Mar")
sort(x)
## [1] "Apr" "Dec" "Jan" "Mar"

Why use factor variables instead of character variables?

Reason 1: Character variables don’t protect you from typos

x <- c("Dec", "Apr", "Jam", "Mar")

Reason 2: Character variables don’t sort in a useful way

x <- c("Dec", "Apr", "Jan", "Mar")
sort(x)
## [1] "Apr" "Dec" "Jan" "Mar"

Factor variables can fix both of these easily.

How to convert a character variable to a factor variable?

month_levels <- c(
  "Jan", "Feb", "Mar", "Apr", "May", "Jun", 
  "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
)
x <- c("Dec", "Apr", "Jam", "Mar")

How to convert a character variable to a factor variable?

y1 <- factor(x, levels = month_levels)
y1
## [1] Dec  Apr  <NA> Mar 
## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

How to convert a character variable to a factor variable?

y1 <- factor(x, levels = month_levels)
y1
## [1] Dec  Apr  <NA> Mar 
## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
y2 <- parse_factor(x, levels = month_levels)
## Warning: 1 parsing failure.
## row col           expected actual
##   3  -- value in level set    Jam
y2
## [1] Dec  Apr  <NA> Mar 
## attr(,"problems")
## # A tibble: 1 x 4
##     row   col expected           actual
##   <int> <int> <chr>              <chr> 
## 1     3    NA value in level set Jam   
## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

How to convert a character variable to a factor variable?

sort(y1)
## [1] Mar Apr Dec
## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

Today’s dataset: NBA data

(Source: Twitter)









Optional material

Different packages for working with different data formats

Factors under the hood

y1
## [1] Dec  Apr  <NA> Mar 
## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
str(y1)
##  Factor w/ 12 levels "Jan","Feb","Mar",..: 12 4 NA 3

Ordered & unordered factors